Idea

The idea behind the article was a documentary about the life of famous “con” artist Mr. Beltracchi link! I often heard saying: “Invest in fine Art, is a great investment!” in the media. Also, many private banks offer client services such as Art Consulting. The service offers advice on buying and selling fine art.

This analysis is an attempt to show evidence art investor might generate using the actual transaction data.

But how? Structured statistics, with the actual transactions, trends or patterns are not existent.

In order to find useful data, programmers are equipped with handy tools - especially, the web scraping techniques.

Approach

  1. Webscrape all the auctioned paintings from the SOGA auction house website (rvest, stringr)
  2. Tidy the raw data into the tidy form (tidyverse, rebus, purrr)
  3. Use tidy data as base for the analysis (tidyverse)
  4. Conclusions

Part I: The webscrape

The primary source of information is the website of Slovak national gallery auction house: https://www.soga.sk (aka SOGA). The website is filled with data that can be categorized as:

  • grouped & labeled auctions (since 2005)
  • all auctioned items are listed
  • artist, artwork name, starting price, final price data are listed

The content of the website will provide the fundamental dataset for this analysis.

First, let’s gather the data and build a database containing all the data available from the website.

But, before we even start web scraping process, lets check if the website allows this behavior. More information about using robot.txt here

library(robotstxt)
robotstxt::robotstxt("soga.sk") %>% print()

Great! The website enables the scraping of the data.

Next, lets identify the elements we want to scape. Selectorgadget identifies web elements containing auction information.

url <- "http://www.soga.sk/aukcie-obrazy-diela-umenie-starozitnosti/aukcie/vysledky-aukcii"

result <- url %>% read_html() %>%
        html_nodes("p.results") %>% 
        html_text()

time <- url %>% read_html() %>%
        html_nodes("p.about") %>% 
        html_text()

link <- url %>% read_html() %>%
        html_nodes("h2 a") %>% 
        html_attr("href") 

df <- cbind(time, result,link) %>% as.tibble()

This chunk of code records following table:

df %>% head() %>% knitr::kable()
time result link
138. zimná aukcia výtvarných diel a starožitností, 05.12.2017 Na aukcii bolo vydražených 159 diel (55%) v celkovej hodnote 499 745 €. /aukcie-obrazy-diela-umenie-starozitnosti/aukcie/138-zimna-aukcia-vytvarnych-diel-a-starozitnosti
137. večerná aukcia, 14.11.2017 Na aukcii bolo vydražených 45 diel (50%) v celkovej hodnote 378 900 €. /aukcie-obrazy-diela-umenie-starozitnosti/aukcie/137-vecerna-aukcia
136. jesenná aukcia výtvarných diel, 26.09.2017 Na aukcii bolo vydražených 152 diel (57%) v celkovej hodnote 198 940 €. /aukcie-obrazy-diela-umenie-starozitnosti/aukcie/136-jesenna-aukcia-vytvarnych-diel
135. letná aukcia výtvarných diel a starožitností, 13.06.2017 Na aukcii bolo vydražených 164 diel (50%) v celkovej hodnote 276 625 €. /aukcie-obrazy-diela-umenie-starozitnosti/aukcie/135-letna-aukcia-vytvarnych-diel-a-starozitnosti
134. večerná aukcia, 25.04.2017 Na aukcii bolo vydražených 47 diel (51%) v celkovej hodnote 665 600 €. /aukcie-obrazy-diela-umenie-starozitnosti/aukcie/134-vecerna-aukcia
133. jarná aukcia výtvarných diel, 07.03.2017 Na aukcii bolo vydražených 129 diel (64%) v celkovej hodnote 212 230 €. /aukcie-obrazy-diela-umenie-starozitnosti/aukcie/133-jarna-aukcia-vytvarnych-diel

Perfect! The successful scrape generated the table with information about: * column time: name of the action & date of the auction * column result: # of auctioned items, % of sold items, auction turnover * column link: contains the weblink to the actual auction.

Unfortunately, the data are still locked in the records (strings). The Information is extraced using string manipulation (stringr, rebus) into separate columns:

  • date of the auction
  • number and name of the auction
  • number of artwork entering auction
  • success rate of the auctions
  • overall sum of all auctioned items
# extract time information 
ptrn <-  ", " %R% one_or_more(DIGIT) %R% "." %R% one_or_more(HEX_DIGIT) %R% "." %R% one_or_more(HEX_DIGIT) 
df$c_time <- df$time %>% str_extract(pattern = ptrn) %>% str_replace(", ","") %>% lubridate::dmy()

# extract name information 
df$c_name <- df$time %>%  str_replace(ptrn,"") 

# extract link information
df$c_link  <- paste0("http://www.soga.sk/",df$link)

# extract percentage information
ptrn_2 <-  "(" %R% one_or_more(DIGIT) %R% "%)" 
df$c_rate <- df$result %>% str_extract(pattern = ptrn_2) %>% str_replace("%","") %>% as.numeric()

# clean items information
ptrn_3 <-  one_or_more(DIGIT) %R% " diel" 
df$c_items <- df$result %>% str_extract(pattern = ptrn_3) %>% str_replace(" diel","") %>% as.numeric()

# clean price information
ptrn_4 <- one_or_more(DIGIT) %R% SPACE %R% one_or_more(DIGIT) %R% " €" 
df$c_price <- df$result %>% str_extract(pattern = ptrn_4) %>%
        str_replace(" ","") %>% 
        str_replace("€","") %>% 
        as.numeric()

# tidy dataframe & calc average price
df_tidy <- df %>%
        select(contains("c_")) %>%
        mutate(year = lubridate::year(c_time),
               quarter = lubridate::quarter(c_time),
               qtr = paste0(year,".",quarter),
               c_price_avg = c_price / c_items)


df_tidy %>% 
        select(-year,-qtr,-quarter) %>% 
        head() %>%
        DT::datatable()

Clean dataset reveals important information about auction house performance over the period of years 2001-2017.

df_summary <- function(data, ...) {
        
        group_var <- quos(...)
        
        data %>% 
                group_by(!!!group_var) %>%  
                summarise(s_price = sum(c_price),
                          s_items = sum(c_items),
                          c_price_avg = mean(c_price_avg)) 
        
}

# chart overview
df_tidy %>%
        df_summary(year) %>%
        rename(average_price = c_price_avg , 
               number_of_items = s_items,
               auction_turnover = s_price) %>% 
        gather(Ratio, Value, 2:length(.)) %>% 
        ggplot(aes(year, Value, group = Ratio)) +
        geom_line(alpha = .3)  +
        theme_minimal(base_family = "Verdana",base_size = 16) + 
        facet_wrap(~Ratio,scales = "free") +
        tidyquant::geom_ma(n = 4,color = "red", linetype = 1, size = 1) + 
        labs(title = "SOGA: Overview ",
             subtitle = "by indicator", caption = "source: www.soga.sk", y = "")

Daata shows SOGA turnover is growing steadily from 2004. Additionally, the average price per painting sold at auction is growing. Average price grew from ~1200€ to current ~5000€ in 14 years. This means either more expensive paintings are being sold or the prices of the artworks are growing. A calculation reveals 11% CAGR in prices per paiting for the timeperiod.

A byproduct of the scrape is the URL to every single auction in SOGA history (for example here http://www.soga.sk//aukcie-obrazy-diela-umenie-starozitnosti/aukcie/134-vecerna-aukcia). This link leads to every single item auctioned at that particular auction.

Next step is taken in order to: * scrape all the websites dedicated to listing items sold at one particular auction (user need to click several times next page in order to see all the items) * scrape the URLs of the particular items traded on the auction (download the individual URL link of the items) * scrape the details regarding starting price, final price, artwork name etc.

These steps can be executed using following code:

extract_auction_length      <- function(url) {
        ptrn <- "page=" %R% one_or_more(DIGIT)
        
        num <- url %>% read_html() %>%
                html_nodes("#auctionArtworks~ .pager .last") %>% 
                html_attr("href") %>% 
                str_extract(ptrn) %>% 
                str_replace("page=","") %>% 
                as.numeric()

        df <- paste0(url,"?page=")
        
        df_seq <- seq(1:num)
        
        df_final <- paste0(df,df_seq)
        return(df_final)
        
} # download the auction websites urls
extract_page_content        <- function(url) {
        print(url)
        
        Sys.sleep(sample(seq(1, 3, by=0.001), 1))
        
        article_list <- url %>% read_html() %>%
                html_nodes("h2") %>% 
                html_nodes("a") %>% 
                html_attr("href") 
        
        article_list_corrected <- paste0("http://www.soga.sk",
                                         article_list)
        
        return(article_list_corrected)
        
} # download the auction items urls
extract_article_content     <- function(url) {
        print(url)
        
        Sys.sleep(sample(seq(1, 3, by=0.001), 1))
        

        art_name <- url %>% read_html() %>% 
                html_nodes("h2 a") %>% 
                html_text("href")
        
        art_piece <- url %>% read_html() %>% 
                html_nodes(".wrapper a") %>% 
                html_text("href")
        
        art_cat <- url %>% read_html() %>% 
                html_nodes(".col1") %>% 
                html_text("href") %>% as.tibble()
        
        art_values <- url %>% read_html() %>% 
                html_nodes(".col2") %>% 
                html_text("href") %>% as.tibble()
        
        art_df <- bind_cols(art_cat,art_values) %>% as.tibble()
        
        art_df$name  <- art_name
        art_df$piece <- art_piece
        
        return(art_df)
        
        
} # download the auction item content

df_all_pages                <- map(df$c_link,extract_auction_length) %>% unlist()

df_all_pages_content        <- map(df_all_pages,extract_page_content) %>% unlist()

df_all_pages_content_vec    <- df_all_pages_content %>% pull()

df_all_article_content_sf   <- map(df_all_pages_content_vec,safely(extract_article_content))

Part II: Raw data into tidy data

The downloaded wide raw data have the following structure:

df_soga <- df_soga %>%
        rename(variable = value, value = value1) %>% 
        drop_na()  
str(df_soga)
## Classes 'tbl_df', 'tbl' and 'data.frame':    47635 obs. of  4 variables:
##  $ variable: chr  "Poradové číslo:" "Vyvolávacia cena:" "Konečná cena:" "Poradové číslo:" ...
##  $ value   : chr  "1" "400 €" "320 €" "2" ...
##  $ name    : chr  "CPIN ŠTEFAN (1919 - 1971)" "CPIN ŠTEFAN (1919 - 1971)" "CPIN ŠTEFAN (1919 - 1971)" "HÁLA JAN (1890 - 1959)" ...
##  $ piece   : chr  "Dievča" "Dievča" "Dievča" "Dievča na lúke (Leto)" ...

The variable column contains the following information:

  • Article Number (Poradové číslo)
  • Starting Price (Vyvolávacia cena)
  • Final Price (Konečná cena)

Importantly, everyday items (thing sculptures) are sold besides paintings. These items will be removed as they are not the target of this analysis.

For more broader audience I will translate the collum names into the English language.

df_soga$variable <- if_else(df_soga$variable == "Cena v predaji:", "sale_price",
                         if_else(df_soga$variable == "Konečná cena:", "final_price",
                          if_else(df_soga$variable == "Nevydražené", "unsold",
                          if_else(df_soga$variable == "Odhadovaná cena:", "assumed_price",
                          if_else(df_soga$variable == "Poradové číslo:", "item_number",
                          if_else(df_soga$variable == "Vyvolávacia cena:", "starting_price",df_soga$variable)))))) # works but looks ugly - sorry folks

The code above is five times nested if_else command - something I am not particularly proud of. It might be a more elegant way to code this but I haven’t found it yet.

Next step is to create an ID collumns that identify author & artwork and spread the data from long to wide - in line with the tidy data manifest written by Hadely Wickham.

df_soga_wide <- df_soga %>%
        mutate(id = paste(name, piece)) %>%  # create unique id per row
        split(.$id) %>%  # split the data per unique artwork
        map(safely(spread),variable, value) %>% # spread the values - use safely to prevent errors
        map_df("result") # filter list with suscessfull results 

glimpse(df_soga_wide)

Now, we have one row per piece and variables stored in columns. This enables a quick and consistent way of working the data. Let’s take a glimpse into the data and let’s find out the most frequently sold author at the auction house.

df_soga_wide$name %>%
        table() %>%
        as.tibble() %>%
        arrange(desc(n)) %>% 
        set_names(c("artist", "count")) %>% 
        top_n(10) %>% 
        DT::datatable()
## Selecting by count

Great! Now we see that the most frequently sold artist was a unknown central European artist with around 360x sold paintings.

Unfortunately, the data is still not tidy. My poorly written web scraping code is responsible for this messy dataset. (I am sure there is a better R code to scape data more accurately)

Now, lets tidy up columns containing the messy data. Columns are still character variables and errors prohibiting transformation into tidy are:

  • Columns with price data contains symbols currency symbol €
  • Blank space before, after and in between the values
  • Date of birth & death of the artist are not in separate columns
  • Name of the artist contains his living years

Let’s remove the errors using string manipulation techniques!

# tidy the columns with variables containing price informations

df_soga_tidy <- df_soga_wide %>%
        modify_at(c(5,6), ~ str_replace_all(.,"€","")) %>% # modify_at helps identify collums more efficiently then naming them using mutate
        modify_at(c(5,6), ~ str_replace_all(.," ","")) %>% 
        modify_at(c(5,6), ~ str_replace_all(.,"Neurčená","")) %>% 
        modify_at(c(5,6), ~ str_replace_all(.,",","")) %>% 
        modify_at(c(5,6), ~ str_replace_all(.,"EUR","")) %>% 
        modify_at(c(5,6), ~ str_trim(.)) %>% 
        modify_at(c(5,6), as.numeric)  

# tidy the column names 
colnames(df_soga_tidy) <- df_soga_tidy %>% 
                        colnames() %>%
                        str_to_lower() %>%
                        str_replace_all(":","") %>%
                        str_replace_all(" ","_")

# tidy the date of birth & death into two separate collumns
living_years <- str_extract_all(df_soga_tidy$name, one_or_more(DIGIT),simplify = T) %>% 
        as.tibble() %>% 
        set_names(c("birth_year","death_year","birth_year_I","death_year_I")) %>% 
        select("birth_year","death_year") %>% 
        modify_if(is.character, as.numeric)


df_soga_tidy <- df_soga_tidy %>% 
                bind_cols(living_years)

# remove years from name column
df_soga_tidy$name <- df_soga_tidy$name %>% 
        str_replace_all(pattern = DIGIT,"") %>% 
        str_replace_all("\\)"," ") %>%
        str_replace_all("\\(","") %>%
        str_replace_all("-","") %>% 
        str_replace_all("–","") %>% 
        str_trim() %>% 
        str_to_upper()

# tidy & include assumed_price in the final price
df_assumed_price_tidy <- df_soga_tidy$assumed_price %>% 
        str_replace_all("-", "/") %>% 
        str_replace_all("– ","/") %>% 
        str_replace_all("€","") %>% 
        str_replace_all("EUR","") %>% 
        str_replace_all(",00","") %>% 
        str_replace_all(",","") %>% 
        str_replace_all(space(),"") %>% 
        str_split("/",simplify = T) %>% 
        as.tibble() %>% 
        modify(as.numeric) %>% 
        mutate(assumed_price = (V1 + V2) / 2) %>% 
        select(assumed_price) 

df_soga_tidy <- bind_cols(dplyr::select(df_soga_tidy, -assumed_price),
                          df_assumed_price_tidy) 


df_soga_tidy %>% glimpse()
## Observations: 15,978
## Variables: 10
## $ name           <chr> "/JIŘÍ/ GEORG DOKOUPIL", "A D.WEGENER   ?", "A....
## $ piece          <chr> "Erotický motív", "Horská krajina", "Česká kraj...
## $ id             <chr> "/JIŘÍ/ GEORG DOKOUPIL (1954) Erotický motív", ...
## $ item_number    <chr> "154", "178", "291", "207", "201", "192", "214"...
## $ starting_price <dbl> 465, 929, 597, 996, 1062, 398, 664, 597, 830, 4...
## $ final_price    <dbl> NA, NA, NA, 1394, NA, NA, NA, 597, NA, NA, 1461...
## $ sale_price     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ birth_year     <dbl> 1954, 1891, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ death_year     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ assumed_price  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

Much better!!! With the above-written code we have achieved:

  • name more variable contains just the name of the artist
  • birth_year, death_year column created
  • assumed price now contains mean value of the range stated
  • numeric vector instead of character vector in the columns containing information about price

Since we have some tidy data now, let’s create some new ratios.

df_soga_tidy <- df_soga_tidy %>% 
        mutate(starting_price = if_else(is.na(starting_price), assumed_price, as.numeric(starting_price)), # if variable starting_price is NA use assumed_price instead 
                diff_price_rel = ((final_price - starting_price) / (starting_price)), # relative difference between starting price and final price
               diff_price_abs = final_price - starting_price ,  # absolute difference between starting price and final price
               bin_price = ntile(starting_price, 4),  # create bins by starting price 
               bin_diff_price_rel = ntile(diff_price_rel, 4)) # create bins by relative profit & losses

Part III: Analysis of the data

With tidy data lets focus on analyzing the data. This part, is always a cherry on the cake after the painful process of data acquisition & cleaning. (Jenny Bryans quote: “Behind every great plot there’s a great deal of wrangling”) - is a accurate one. Link here

First, let’s look at the highest prices paid for painting on auction:

df_soga_tidy %>% 
        drop_na(final_price) %>% 
        top_n(wt = final_price,n = 10) %>% 
        ggplot(aes(reorder(piece, final_price),final_price, fill = name)) +
        geom_col() +
        coord_flip() + 
        labs(title = "Highest price paid per paining",
             subtitle = "sorted by name of the artwork",
             caption = "Source: www.soga.sk") + 
        theme_minimal()  + 
        scale_y_continuous(labels = scales::comma) +
        theme(legend.text  = element_text(size = 6),
                legend.title = element_text(size = 6),legend.position = "bottom")

The highest auctioned painting was for the “Chlapec s Dalmatinom” by DOMENICHINO DOMENICO ZAMPIERI. Interestingly it failed to reach its owner price of 663 878 EUR and was sold at a loss of (-132 775 EUR or -19%) for final price of 531 103 EUR.

The second chart shows paintings sold at the highest relative profit:

df_soga_tidy %>% 
        drop_na(diff_price_rel) %>% 
        top_n(wt = diff_price_rel,n = 15) %>% 
        ggplot(aes(reorder(piece, diff_price_rel),diff_price_rel)) +
        geom_col() +
        coord_flip() + 
        theme_minimal() + 
        labs(title = "Highest profit generated by painting (in %)",
             subtitle = "relative, sorted by name of the artwork",
             y = "Profit (in %)",
             x = "Artwork",
             caption = "Source: www.soga.sk") + 
        scale_y_continuous(labels = scales::percent) 

Surprisingly, the highest profits were in the range of 10x - 15x fold return This is a great result for the investor. The most profitable painting was - “Zátišie s ľudovým džbánom” by KRIVOŠ RUDOLF ( link ). This painting was also an outlier as rest of KRIVOŠ paintings generated far lower returns.

Another point of view is to visualize the most profitable paintings in the absolute terms.

df_soga_tidy %>% 
        drop_na(diff_price_abs) %>% 
        top_n(wt = diff_price_abs,n = 15) %>% 
        ggplot(aes(reorder(piece, diff_price_abs),diff_price_abs)) +
        geom_col() +
        coord_flip() + 
        theme_minimal() + 
        labs(title = "Highest profit generated by painting (in €)",
             subtitle = "absolute, sorted by name of the artwork",
             y = "Profit (in %)",
             x = "Artwork",
             caption = "Source: www.soga.sk") + 
        scale_y_continuous(labels = scales::comma) 

Painting V žatve was the paiting sold at the highest profit of about 85 000 EUR. This is an equivalent of the price paid for small 1 bedroom flat in Bratislava.

Now, let’s visualize the result of the SOGA auctions:

df_loss_profit <- df_soga_tidy %>% 
        drop_na(final_price, bin_price)

library(plotly) 



plotly::plot_ly(df_loss_profit,
                x = ~starting_price,
                y = ~ final_price,
                color = ~ bin_diff_price_rel,
                text = ~paste("Name: ", name, "\n",
                              "Artwork:",piece,"\n",
                              "Starting Price: ", scales::comma(starting_price), "\n",
                              "Final Price:",scales::comma(final_price),"\n",
                              "Return (%):",scales::percent(diff_price_rel),"\n",
                              "Return (€):", scales::comma(diff_price_abs))) %>%
  layout(title = "Profits & Losses @ SOGA Auction House",
         xaxis = list(title = "Starting Price",range = c(0, 200000)),
         yaxis = list(title = "Starting Price",range = c(0, 200000)))

Now it is pretty clear that the outliers where price difference (between starting and final price) was big. This means some paintings fail to meet the expectation and was traded well below the starting price. Another observation - Profits are getting smaller the more we move up the price tag.

Finally, let’s make a summary statistics with the most profitable artist sold at the SOGA auction house.

df_soga_tidy %>% 
        drop_na(final_price,diff_price_rel,diff_price_abs) %>% 
        group_by(name) %>% 
        summarise(profit_relative = median(diff_price_rel, na.rm = T) %>% scales::percent(),
                  profit_absolute = median(diff_price_abs, na.rm = T) %>% scales::comma(),
                  artwork_price = median(final_price, na.rm = T) %>% scales::comma(),
                  count = n(),
                  sum = sum(diff_price_abs) %>% scales::comma()) %>% 
        filter(count > 5) %>% 
        arrange(desc(profit_relative)) %>% 
        dplyr::top_n(wt = profit_relative, n = 15) %>% 
        knitr::kable()
name profit_relative profit_absolute artwork_price count sum
LADISLAV GUDERNA 87.7% 2,257 4,149 7 11,284
EDITA SPANNEROVÁ 86.7% 249 713.5 8 2,754
BIZMAYER IGNÁC 85.6% 325 1,325 14 7,820
STRUHÁR ALOJZ 8.33% 25 310 6 200
MARTIN SEDLÁK 8.28% 66 1,062 7 631
ZBYNĚK PROKOP 76.5% 50.5 116.5 6 302
JÚLIUS JAKOBY 74.2% 3,071 7,800.5 6 16,000
FRANTIŠEK LIPTÁK 7.93% 100 746.5 12 1,661
VINCENT HLOŽNÍK 7.49% 132 1,461 19 14,736
DÁVID BAFFI 66% 49.5 99.5 6 397
JOZEF JANKOVIČ 64% 531 1,759 9 15,469
FRANTIŠEK REICHENTÁL 62.5% 132 1,162 9 11,021
KĽÚČIK PETER 60% 90 240 17 6,700
MILOŠ ALEXANDER BAZOVSKÝ 6.31% 67 3,485 21 26,457
KERN PETER JÚLIUS 5.56% 10 470 10 730

The results are filtered with the condition of the artist selling at least 5 paintings trough SOGA to limit the number of records and provide liquidity for the investor. Owning artwork of authors that are rarely trade is a liquidity risk for the investor and is recommended to avoid.

Conclusions

Known limitations